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ABSTRACT 

We revisit a recently introduced power spectrum estimation technique based on Gibbs sampling, 
with the goal of applying it to the high-resolution WMAP data. In order to facilitate this analysis, 
a number of sophistications have to be introduced, each of which is discussed in detail. We have 
implemented two independent versions of the algorithm to cross-check the computer codes, and to 
verify that a particular solution to any given problem does not affect the scientific results. We then 
apply these programs to simulated data with known properties at intermediate (iV s id e = 128) and high 
(-YsidG = 512) resolutions, to study effects such as incomplete sky coverage and white vs. correlated 
noise. From these simulations we also establish the Markov chain correlation length as a function of 
signal-to-noise ratio, and give a few comments on the properties of the correlation matrices involved. 
Parallelization issues are also discussed, with emphasis on real-world limitations imposed by current 
super-computer facilities. The scientific results from the analysis of the first-year WMAP data are 
presented in a companion letter. 

Subject headings: cosmic microwave background — cosmology: observations — methods: numerical 



1. INTRODUCTION 

The subject of cosmic microwave background (CMB) 
power spectrum estimation has been a very active 
research field for many years now, and the com- 
bined effort from the scientific community has re- 
sulted in a number of qualitatively different meth- 
ods. Broadly, one may classify these methods into 
three groups , namely maximum likelihood methods 
llG6rskilll994l lL996t FLegmarldll997t iBond Jaffe fe Knoxl 
ll998HOh et al] | 1999HDore. Knox fc Pee]H20f)lft. pseudo- 
Cf m ethods (Wandc hi IHivon fc Gorskit iHivon et al. 
2002t lHansen. Gorski fc Hivonl l2002t lHansen fc Gorski 
2QQ3) , and specialized methods Ivan Leeuwe n et al. 
20n2HChallinor et al.l2rlollWandelt fc Hansenl200.^ . In 
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general the maximum likelihood methods are more ac- 
curate than the (often Monte Carlo based) pseudo-Cf 
methods, but they are usually so at a prohibitive com- 
putational cost. And even these methods can only return 
very approximate summaries of the error bars, since ex- 
ploring the likelihood away from the peak is practically 
impossible. Further, the specialized methods are usually 
only applicable following rather restrictive assumptions. 
For general experiments we have up to now been left 
with the rather uncomfortable choice between the op- 
timal but prohibitively expensive, and the feasible but 
approximate. 

In this context, a method based on Monte Carlo 
Markov Chains and Gibbs sampling was very recently 
developed by Jewell, Levin & Anderson (2004) and Wan- 
delt, Larson & Lakshminarayanan (2004) which may 
change this picture. The fundamental idea behind this 
method is to solve the power spectrum estimation prob- 
lem by establishing its posterior probability distribution 
through sampling, rather than by direct solution of the 
corresponding optimization problem. In its most gen- 
eral form, the scaling of this Monte Carlo method equals 
that of the map making process, which is to be com- 
pared to the typical (D(Np ix ) scaling for traditional maxi- 
mum likelihood estimators (Borrill 1999), iV p i x being the 
number of pixels in the map. Further, for an experi- 
ment with spherically symmetric beams and uncorrelated 
noise, one may work directly with maps instead of time- 
ordered data, in which case the scaling reduces to that of 

a spherical harmonics transform, namely O(N^), using 
the HEALPix 10 pixclization. 

Until now, the only application of this method to cos- 
mological data was the analysis of the low-resolutio n 
COBE-DMR data, presented bv iWandelt et all {2004). 
In the following we demonstrate the practicality of this 
method for current and future experiments, as we for 

10 http:/ /www. eso.org/science/healpix/ 
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the first time apply it to a large data set, namely the 
Wilkinson Microwave Anisotropy Probe ( WMAP) data 
ijBennett et al.ll2003a|) . This data set consists of eight 
cosmologically important frequency bands, each with 
about three million pixels, and it is therefore an excel- 
lent test bed for any new algorithm. We have devel- 
oped two independent implementations of the algorithm, 
one called Commander 11 ("Commander is an Optimal 
Monte-carlo Markov chAiN Driven EstimatoR" ) and the 
other called MA GIC 12 ( "Magic Allows Global Inference 
of Covariance") ijWandeltl 12003). We have tested these 
implementations extensively, and found that they pro- 
duce statistically identical results. 

The goals of the present paper are twofold. First, we 
prepare for the actual WMAP analysis by developing a 
number of sophistications to the Gibbs sampling algo- 
rithms necessary to facilitate a high-resolution analysis. 
Second, we apply the computer codes to simulated data 
in order to verify that the codes perform as expected, 
and to build up an intuitive understanding of issues such 
as the Markov chain correlation length vs. the signal- 
to-noise ratio, multipole coupling vs. sky coverage, and 
sufficient sampling vs. overall CPU time. A proper un- 
derstanding of these questions is crucial in order to opti- 
mize real- world analyses. The scientific results from the 
WM AP analys i s are reported in a companion letter by 
IQ'Dwver et all p)0l . 

2. ALGORITHMS 

This paper is a natur al ext ension of the work pre sented 
by iJewell et ail i|2004fl and iWandelt et"aD (|20^0j), and 
we will in the following frequently refer to those papers. 
Further, we do not attempt to re-establish the motiva- 
tion behind the Gibbs sampling approach here, but re- 
fer the interested reader to those papers for details and 
proofs. In the present paper we simply summarize the 
operational steps of the algorithm, and specialize the dis- 
cussion to the problems encountered when analyzing the 
first-year WMAP data. 

We now define some notation. The data are given in 
the form of N sky maps (also called "bands" or "chan- 
nels") 

d fc = A fc s + n fc , (1) 

where k = 1, . . . , N runs over the bands, is the vector 
of observed pixel values on the sky, A& is the matrix cor- 
responding to beam convolution, s is the true sky vector, 
and rifc is instrumental noise. 

As mentioned above, our main scientific goal of the cur- 
rent work is to analyze the first-year WMAP data, and 
for that rea son we assume th e beam to be azimuthally 
symmetric l|Page et alJ I2003D . The beam convolution 
A may therefore be computed in harmonic space by a 
straightforward multiplication of the corresponding Leg- 
endre components b\. In order to simplify the notation, 
we incorporate the pixel window function into b\. 

Further, for the WMAP data, it is reasonable to ap- 
proximate the noise as uncorrelated, but non-uniform, 
so that the real space noise covariance matrix can be 
written as N^jj = o~f k 6ij, where a^k is the noise stan- 
dard deviation of the ith pixel of the kth sky map. In 



11 Implemented by H. K. Eriksen and J. B. Jewell. 

12 Implemented by I. J. O'Dwyer, D. L. Larson and B. D. Wan- 



fact, we explicitly demonstrate the validity of this as- 
sumption in section 14.21 by first analyzing simulations 
including white noise and then correlated noise, showing 
that they are statistically consistent for the levels of cor- 
related noise present in the WMAP data. Finally, we as- 
sume the CMB fluctuations to be Gaussian and isotropic, 
and the signal covariance matrix therefore simplifies con- 
siderably, Ci m ,t m > = Ci5 u > 6 mm >. 

2.1. Basic Gibbs sampling 

The idea behind the Gibbs sampling power spectrum 
estimation technique is to draw samples from the prob- 
ability density P(Ce |d). The properties of this density 
can then be summarized in terms of any preferred statis- 
tic, such as its multivariate mean or mode. However, one 
of the major strengths of the Gibbs sampling approach 
is that it allows for a global, optimal analysis, and it 
should therefore not be considered as yet another max- 
imum likelihood technique, although it certainly is able 
to produce such an estimate. 

While direct sampling from the probability density 
P(Ce\d) is difficult, it is in fact possible to sample from 
the joint density P(C^, s|d), and then marginalize over 
the signal s. This is feasible because the theory of Gibbs 
sampling tells us that if it is possible to sample from the 
conditional densities P(s|CV,d) and P(Ci\s, d), then the 
two following equations will, after an initial burn-in pe- 
riod, converge to being samples from the joint density 
P(Q,s|d): 

P( S \Cj,d), (2) 

mis i+1 )- (3) 

Thus, given some initial power spectrum and the data, 
we may iterate these two relations, discard the first few 
pre-convergence samples (if necessary), and then use the 
remaining samples to construct whatever statistic we pre- 
fer for the power spectrum. One further advantage of this 
approach is that we probe the joint distribution, and we 
may therefore quantify joint uncertainties. And, in the 
process, we also obtain a Wiener filtered map which may 
be useful for other studies. 

Drawing a power spectrum C\ +1 given a sky map s 4 is 
trivial (see, e.g., Wandelt et al. 2004). Given the power 
spectrum of the signal map (often written on the form 

°l = Em=-£ K 

variates 



a 



i+l 



2 ), one draws 21— 1 Gaussian random 
•■J with zero mean and unit variance, and form 
Y^jLi 1 \Pi \ 2 ■ The desired power spectrum 
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sample is then given by 
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(4) 



On the other hand, drawing a sky map s l given the 
data and an assumed power spectrum, is certainly not 
trivial. Again, we refer the interested reader to the 
above-mentioned papers for justification of the follow- 
ing procedure, and here we only review the operational 
steps. 

The map sampling process is performed in two steps, 
the first being to solve the following equation for the so- 
called mean field map x, 



dclt. 



N 



Lfc=l 



N 



x = ^AT N ,T 1 4. (5) 



k=l 



Fig. 1. — Examples of the maps produced in one step of the Gibbs sampler. Top panel: The full-sky, noise-less Gibbs sample, s. Middle 
panel: The mean field (Wiener-filtered) map, x. Bottom panel: The fluctuation map, y. 



Here r| = is the residual signal map. The reason 
for introducing this notation will become clearer when 
additional components are introduced into the sampling 
chain. Any new component we may wish to include in 
the analysis will simply be subtracted from the data, to 
form an actual residual map from which the mean field 
map is computed. 

The mean field map is a generalized Wiener filtered 
map, and as such is biased. To construct an unbiased 
sample one must therefore add a fluctuation map y with 
properties such that the sum of the two fields is a sample 
from the distribution of the correct mean and covariance. 



The appropriate equation for this fluctuation map is 
/ f N 



cr 1 



N 



(6) 



C-V»a, + 5>£N 
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fc=l 



where Wk are Gaussian white noise maps of zero mean 
and unit variance. 

Examples of such maps are shown in figure ^ and the 
corresponding power spectra are shown in figure[5] How- 
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Fig. 2. — Spectra corresponding to the maps in figure HI The 
red lines shows the spectrum of the Wiener-filtered map, and is a 
biased estimate of the underlying spectrum. Therefore, the Gibbs 
sampler adds a fluctuation term to the Wiener-filtered map, to 
yield an unbiased estimate of the true spectrum. 



ever, in practice the two equations are solved simultane- 
ously by solving for the sum of x and y, in order to reduce 
the total CPU time. 

Finally, we point out that even though the Gibbs sam- 
pling technique is a Bayesian method, a frequentist view 
may be taken by choosing a uniform prior. In that case, 
the procedure reduces to simply exploring the joint like- 
lihood, and frequentist concepts such as the maximum 
likelihood estimate may be established. 

2.2. Consistent treatment of mono- and dipole 
contributions 

One of the most elegant features of this formalism 
is its ability to incorporate virtually any real- world 
complication, as disc ussed by IJewell et all ll2004|) and 
iWandelt et alJ l|2004|) . A few examples of this flexibil- 
ity are applications to 1// noise, asymmetric beams, 
non-cosmological foregrounds, or arbitrary sky coverage. 
However, in this paper we include only the effects of the 
mono- and dipole contributions (which may be thought 
of as foregrounds) and that of partial sky coverage, given 
that our main scientific goal is to analyze the fairly well- 
behaved WMAP data. 

The question regarding mono- and dipole contribu- 
tions has gained renewed importance during the previ- 
ous year, given the very active debate concerning the 
quadrupole seen in the WMAP data. This quadrupole 
appears to be small compar e d to t he best-fit cosmo- 
logical model llSpereel et all 12003 lEfstathioul I2003at 
Ide Oliveira-Costa et alJl2004|) . and several authors have 
considered what this may imply in terms of new physics. 
However, the exact significance of this anomaly is dif- 
ficult to assess for several reasons, but mainly because 
of uncertainties in t he foreground subtra ction process 
ijFriksen et a,l.ll2T)0i ISloTa.r Selia.klEiol . Methodol- 
ogy issues for estimating the lowe st multipole ampli - 
tudes have also been pointed out (Efstathiou 2003b). 
Strongly related to both these issues is the fact that non- 
cosmological mono- and dipole contributions may couple 
into the other low-order modes through incomplete sky 
coverage. 



The most common way of handling this latter prob- 
lem is to fit a mono- and dipole to the incomplete sky, 
including internal coupling caused by the sky cut, and 
then simply subtract the resulting best-fit components 
from the data. However, this procedure neglects the 
noise correlations that are introduced by removing any 
fitted templates. The Gibbs sampling framework allows 
a statistically more consistent approach: rather than di- 
rectly subtracting the fitted mono- and dipoles from the 
data, one may marginalize over them through sampling, 
and thus recognize the inherent uncertainties involved. 

As always in Bayesian analyses, one has to choose a 
prior, and the most natural choice in this case is a uni- 
form prior. This corresponds to saying that we do not 
know anything about these components. For analytic 
computations and proofs, however, it is more convenient 
to define this as a Gaussian with infinite variance, which 
is just a different way of parameterizing a uniform prior. 
It should be noted that a uniform prior does not mean 
that these components are unrestricted, but, rather, it 
simply means that their values are determined by the 
data alone. 

Again, general formalisms for hand ling this type of 
problem were descr ibed by IJewell et al.l l|2004|) and 
IWandelt et aD l)2004j) . and we will only repeat the opera- 
tional steps here, in a notation suitable for our purposes. 
Let us first define a Ap; x x 4 template matrix T con- 
taining the four real spherical harmonics in pixel space, 



where Yp 



T — (Y o, Yi_i, Yio, Yn) , 

(Yi m (6q , 0i ) , . . . , Yt m (0]V pix , 07V pix )) ' 
= 1/V47T 



YooO 

Yi_i(0,<£) = V3/47T sine? sin 
iio(( 
Y n (i 



= \/3/4tt cost? 
= \/3/4ir sin 9 cos < 



(7) 
and 

(8) 
(9) 
(10) 
(11) 



Note that T is a projection matrix onto the subspace 
spanned by the corresponding templates. 

Next we define a vector of template amplitudes = 
(oq , a\_ l , a\ Q , ai 1 ) T , letting the amplitudes be different 
for each channel, since we have no reason to assume that 
these components are frequency independent. Thus, the 
mono- and dipole contribution to the fcth channel is = 

We now want to sample from the conditional distribu- 
tion P(w/j|dfc, s), and this is done (assuming the infinite 
variance prior) by solving the following equation, 

(T T N^T) w fc = T T N* ^ + S k , (12) 

d k - 



where the mono- and dipole residual map is r 
AfcS, and 



md 

fe 
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1 00 
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1/2 , ,(1) 
1/2 , ,(2) 
1/2 , ,(3) 
1/2 , 14) 



(13) 



(i) 

Here, ui k are white noise maps of vanishing mean and 
unit variance. 

The next step in traditional Gibbs sampling would now 
be to sample from the conditional density P(p m d|wfc), 
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where p m d are the parameters of the probability distri- 
bution describing the mono- and dipoles. However, since 
we have chosen a very special prior, namely one with in- 
finite variance, this distribution does not change, and no 
sampling is required. 

Including the mono- and dipole components in the 
Gibbs sampling chain, it now reads 

w^ +1 ^P(w fc |d fc ,s 4 ), (14) 

B *+ 1 <-P(s|C{,d ) w*+ 1 ) J (15) 

C\ +l «- P{d\s t+1 ). (16) 

The first step is computed as described in the previous 
paragraphs, and the second step is computed by equation 
with the slight modification that the mono- and dipole 
contributions now are subtracted from the data, r| = 
d fc - Tw fc . 

2.3. Incomplete sky coverage 

Perhaps the single most important complication in any 
CMB analysis is proper treatment of foregrounds. With 
amplitudes up to several thousand times the CMB ampli- 
tude, Galactic foregrounds will necessarily compromise 
any cosmological result unless corrected and accounted 
for. Unfortunately, there is currently a critical shortage 
of robust component separation (or even just foreground 
removal) methods, and the only reliable approach at the 
time of writing is simply to mask out the most contam- 
inated regions of the sky. On the bright side, the flexi- 
bility in specifying foreground models that can be imple- 
mented in the Gibbs sampling approach offers an attrac- 
tive avenue for progress. This will be explored further in 
future publications. 

The Gibbs sampling approach supports two fundamen- 
tally different methods for removing parts of the sky by 
means of a mask. First, the most straightforward option 
from a conceptual point of view is simply to set the in- 
verse noise matrix to zero at all pixels within the mask. 
This corresponds to saying that the noise level of these 
pixels is infinite, and therefore that the data are com- 
pletely non-informative. No other modifications of the 
equations are necessary. This is the solution chosen for 
the Commander implementation. 

However, this approach carries a considerable cost 
in the form of a poorly conditioned coefficient matrix 
A = C _1 +A T N _1 A, which, as we will discuss at greater 
length in the next section, results in slow convergence for 
the conjugate gradient algorithm, and increased over- 
all expense for the Gibbs sampling. Recognizing this 
fact, an alternative approach was chosen for the MAGIC 
implementation, namely to introduce a new foreground 
component into the Gibbs sampling chain. 

Let us recall the g eneral sampling equa tion for the fore- 
ground component (Wandclt ct al. 2004), 

(F^ + A£N^A fc )f fc = 

- aTiv-Vs ^f" 1 / 2 , ,w A T i\ri ,(2) ^ ' 
- A fe iX /t v k +t k w k + A fc JN / £ w fc • 

Here Ffc is the covariance matrix for the foreground prior, 
r^ s is the residual map after removal of the signal esti- 
mate and any other foregrounds already sampled by the 

algorithm, and are vectors of uniform Gaussian vari- 
ates. Finally, ffc is the unconvolved foreground sample. 



For each pixel in the masked region, mainly the Galaxy 
but also some point sources, we do not know the fore- 
ground contribution. The maximally uninformative fore- 
ground prior for these pixels has infinite variance. It cor- 
responds to a complete lack of a priori knowledge of the 
foregrounds in the mask. By specifying maximal igno- 
rance of the foreground we allow the algorithm to de- 
termine the level of the foreground in these pixels which 
is supported by the data. Substituting this foreground 
prior into equation 1171 creates a method to numerically 
marginalize over the unknown foreground contribution in 
the masked pixels. 

In the limit of 'infinite' variance, this sampling equa- 
tion simplifies to 

A fc f fe = Nl /2 uj k + if (18) 

in the masked region and ffe — outside. This is easy to 
compute and avoids the use of the Conjugate Gradient 
solver, hence saving computational time. 
With the introduction of this foreground component, 



the full Gibbs chain reads 

f* +1 ^P(f fe |d fe ,s*,w^), (19) 

w* 1 ^P(w Jk |d fc ,s < ,fj +1 ), (20) 

s I + 1 ^P(s|Q,d,f^ +1 ,w^ 1 ), (21) 

C\ +1 «- P(C e \s t+1 ). (22) 



Again, the only modifications in the two middle steps 
is a subtraction of the foreground components from the 
corresponding residual maps, r% — d^ — Tw^ — A^ffc and 

it d = d fc -A fe (s + f fe ). 

As mentioned earlier, the main advantage of this ap- 
proach is that the uniform properties of the coefficient 
matrix A are conserved, leading to a faster convergence 
for the conjugate gradient solver, often reducing the num- 
ber of iterations by a factor of three. On the other hand, 
there is also a slight disadvantage in that the correla- 
tions between consecutive Gibbs samples are stronger, 
since information is carried over from sample to sam- 
ple through the foreground component. However, this is 
more than compensated by the rapid CG convergence. 
We will return to these issues later. 

3. COMPUTATIONAL CONSIDERATIONS 
3.1. Conjugate gradients and preconditioning 

As described in section l2~Tl equations and are the 

very heart of the Gibbs sampling method, and its feasibil- 
ity is directly connected to our ability to solve those equa- 
tions. For a low-resolution experiment such as COBE- 
DMR, which comprises a few thousand pixels or mul- 
tipole components, the system may be solved directly, 
for instance through Cholesky decomposition. However, 
for a high-resolution experiment such as WMAP, with 
eight cosmologically important maps of each several mil- 
lions pixels, more sophisticated algorithms must be em- 
ployed, and the most efficient method currently available 
for positive-definite matrices is the Conjugate Gradient 
(CG) method l)Golub & van Loanlll996D. For a truly ex- 
cellent review of this algorithm, see iShewchukl l)1994|) . 

The general problem is to solve a system of linear equa- 
tions, 

Ax = b, (23) 
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(c) Kp2 sky cut, £-based ordering 



(d) Kp2 sky cut, m-based ordering 



Fig. 3. — The coefficient matrices A= 1 + C 1 / 2 [^^_ 1 A^N^ Aj-JC 1 / 2 , summed over all eight WMAP channels, and using the power 
spectrum estimated by the WMAP team. All elements up to £ max = 59 are included, a choice determined by plotting constraints only. The 
upper panels plot the matrix when the full sky is available, and the lower panels plot it when the Kp2 mask is applied. The elements arc 
ordered by £-major (with pixel index i given by i = £ 2 + i + m + 1) in the left column, and by m-major in the right column, (m increases 
as m = 0, —1, 1, —2, 2, . . . from left to right, in steps of £ max — \m\. Within each m-block t = |m|, \m\ + 1, . . . , f max .) A solid black color 
indicates a signal-to-noise ratio larger than 5. 



where the coefficient matrix A is very large. In the case 
of the first-year WMAP data, A corresponds to a system 
of three million equations in pixel space, and a system of 
several hundred thousands equations in harmonic space 



(depending on the £ max of choice; see section EL 1.21 for a 
discussion on how to choose an appropriate £ max ). Fur- 
ther, this coefficient matrix is in general not sparse in ei- 
ther real space due to complicated signal correlations, or 
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in harmonic space due to complicated noise correlations. 
However, favorable sparsity patterns may be obtained for 
special scanning strategie s and sky cuts l|Oh et al.lll999t 
iWandelt fc HanseiJl2003|) . 

For this reason, the sheer size of the problem poses a 
real problem, and for many applications one may find 
that the system described above is ill-conditioned. For 
instance, the solution vector x contains elements of very 
different magnitudes, and therefore round-off errors can 
easily compromise the results. It is therefore numerically 
advantageous to rewrite equations |5] and [B] as follows, 

N 



G l/2 



E A " N * 



G l/2 



,fc=l 



N 



G l/2 



-1/2, 



N 



(24) 



= C 1 /^ A jN- 1 rI 



k=l 



E A * N ^ A * 



G l/2 



(25) 



LOq 



-1/2 



UJk- 



fc=l 



The coefficient matrix A = 1 + C 1 / 2 A T N- 1 A T C 1 / 2 is 
now much better behaved, and all elements of the solu- 
tion vector C _1 / 2 x have unity variance. Note also that 
the diagonal elements of A are now simply the signal-to- 
noise ratios of the corresponding mode. 

We choose to work in harmonic space in the following 
for several reasons. First, in this space it is easy to limit 
the size of the problem according to the signal-to-noise 
ratio of the data by choosing an appropriate £ max . In 
pixel space one is always forced to work with vectors of 
length iVpi x . Second, given the form of equations and 
El two spherical harmonics transforms are eliminated by 
operating in harmonic space in the first place, thereby 
reducing the total CPU time by a factor of two. Finally, 
since we are mainly interested in the power spectrum, 
an harmonic space based convergence criterion for the 
CG search seems more natural than a pixel space based 
criterion. 

One of the main advantages of the CG algorithm is 
that it does not require inversion of the coefficient ma- 
trix, and we do not even need to store it. All we need 
is the ability to multiply A with a given vector v, and 
solving a preconditioning equation. Wc first consider the 
matrix multiplication operation. In our setting, for which 
A=l + C 1 / 2 A T N" 1 A T C 1 / 2 , this is done in a step- wise 
fashion. First we multiply each component o>i m of the 
input vector by \fC~ibi (where be is the product of the 
beam and pixel window functions), and then we perform 
an inverse spherical harmonic transform into real space. 
Here we multiply with the inverse noise matrix, -/V b s /<TQ, 
under the assumption of uncorrelated noise. Then we 
perform an ordinary spherical harmonic transform of the 
vector into harmonic space, where we again multiply with 
the beam and square root of the power spectrum. Finally 
we add the original vector. Thus, multiplication of A 
is computationally equivalent to two spherical harmonic 
transforms, and memory requirements are virtually neg- 
ligible 13 . 

13 If accessible memory is sufficient on the available computer, 



The efficiency of the CG algorithm is highly dependent 
on our ability to construct a good preconditioner (e.g., 
Oh et al. 1999), and two preconditioners have been pro- 
posed for this problem so far, both approximating A -1 
in harmonic space. First, under the assumption of white, 
but non-uniform, noise, the inverse real-space noise co- 
variance matrix may be written as a simple inverse noise 
rms map, N~ 1 (8, </>), which again may be expanded into 
spherical harmonics, 
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The inverse noise matrix in spherical harmonic space is 
then ijHivon et alJ l2002) 
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A very simple preconditioner may therefore be defined 
in terms of the diagonal elements only, 
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While satisfactory for the simplest applications, we find 
that it takes about 300 iterations to solve for the WMAP 
data consisting of all eight cosmologically interesting 
bands with this preconditioner (applying the Kp2 mask 
directly) , making the total solution of the problem very 
expensive. 

By conside ring the o verall structure of the inverse 
noise matrix, lOh et all l|1999Tl proposed to use a block- 
diagonal matrix. In the limit of perfect azimuthal sym- 
metry of both the galactic cut and the noise distribu- 
tion, N _1 is orthogonal with respect to to, and there- 
fore it makes sense to also include all elements having 
l\ 7^ £2 , fni = WI2 up to some arbitrary limit m ma i. At 
higher to's, the diagonal preconditioner is used. lOh et all 
( 1999) claims to achieve convergence in six iterations 
with this preconditioner for properties corresponding to 
the two-year WMAP data, but, unfortunately, we have 
not yet been able to reproduce this performance. From 
our experiments it seems the combination of a highly 
non-symmetric Kp2 cut, 700 resolved point source cuts, 
and a noise distribution tilted with respect to the galactic 
plane introduces significant couplings between different 
to's. 

In figureOlwe have plotted the coefficient matrices cor- 
responding to the first-year WMAP data in two different 
orderings, both ^-major and TO-major (see caption for 
details), and with and without application of the Kp2 
mask. In the limit of uniform noise and no galactic cut, 
these matrices would all be diagonal, and convergence 

one may want to precompute the associated Legendre polynomials, 
reducing the total CPU time typically by a factor of two or three 
for WMAP type maps in the current HEALPix implementation. 
The memory requirement for doing so is 8 N s i^ c ^ bytes, or on 
the order of 1GB for iV 8ide ,&nax ~ 512, N sidc being the HEALPix 
resolution parameter, which corresponds directly to the number of 
pixels in the map through the relation iV p i x = 12 A r ^ dc . 



would be reached in one single CG iteration using even 
the diagonal preconditioner. 

However, as seen in the top two panels of figure |3| 
adding non-uniform noise to the problem introduces sig- 
nificant coupling between different modes, which again 
leads to poorer CG performance. In the left panel, we 
see that the largest absolute values are found at low £'s, 
which of course is not very surprising, considering that 
these matrices are a measure of the signal-to-noise ra- 
tio. In the right panel we see the same matrix organized 
as m-major, and in the limit of azimuthal symmetry, 
this would be a strictly block-diagonal matrix with very 
small block elem ents. The preconditioner proposed by 
lOh et alJ (|1909f) consists of the inverses of those blocks. 
But, as we see, there are many off-diagonal elements in 
this matrix, and, indeed, the dominant elements actually 
seem to be components for which \m\ — ra^\ = 1- How- 
ever, if we had computed these quantities in the ecliptic 
frame, rather than in the galactic, then the matrix is 
likely to be dominated by the m\ = ni2 elements, and 
possibly even by the m = elements. 

The bottom two panels show a similar set of matrices, 
but in this case the Kp2 mask has been applied to the 
sky. And, as mentioned in section l2~3l this has the highly 
undesirable effect of magnifying the off-diagonal elements 
through mode-to-mode coupling considerably. Unfortu- 
nately, neither of these matrices have a very dominant 
symmetry structure, and it is therefore difficult to estab- 
lish an optimal preconditioner. 

Nevertheless, based on the structures seen in the lower 
left panel in figure [3] a third alternative was chosen for 
the Commander implementation. Rather than includ- 
ing only the diagon al elements, or only mi = mi ele- 
ments as IQh et alJ lEHI do, we include all elements 
up to some arbitrary £ 

max. 

(typically I 

max w 50-70 for 

WMAP), and at higher Ps we include only the diagonal 
elements. The required memory requirements for this 
matrix scales as C(^ ax ), an d are thus quite expensive, 
but in practice, the real limitation is the CPU time re- 
quired for its Cholesky decomposition (which scales as 
C[£^ ax ]) rather than memory requirements for its stor- 
age. For £ ma . x = 50, the memory requirements are 52 
MB and the CPU time for Cholesky decomposition is 
on the order of one or two minutes. Obviously, the lat- 
ter number must be compared to the CPU time it takes 
to perform one CG iteration and the number of itera- 
tions saved. And yet, even with this rather expensive 
preconditioner, we find that the CG search converges in 
about 60 iterations for the combined first-year WMAP 
data and a norm-based fractional convergence criterion 
of I0~ 6 . Thus, our performan ce is not a s impr essive as 
the six iterations achieved bv IQh et alJ (|1999f) . Work 
on this issue is still on-going, and a hybrid of all three 
variants may prove to be the ultimate solution. 

In contrast to the Commander implementation, 
MAGIC does not apply a sky cut directly, but instead it 
introduces a new random field into the sampling chain. 
The appropriate coefficient matrix is therefore the one 
shown in the upper left panel of figure^ This choice has 
a very positive effect in terms of CG performance, and 
one routinely achieves convergence within 20 iterations 
using just the simple diagonal preconditioner for a first- 
year WMAP type experiment. However, as we will see 
later, the cost for this performance comes in the form of 
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Fig. 4. — A few selected histograms of the power spectrum 
samples produced in the low-resolution analysis. The black his- 
tograms are generated by Commander, and the red histograms are 
generated by MAGIC; the agreement is striking, demonstrating 
that both codes work as expected. The vertical solid lines indicate 
the theoretical input spectrum, and the dashed lines indicate the 
realization specific spectrum. The agreement between the peak po- 
sition of the histograms and the dashed lines is excellent at low Pb. 
At high £'s, however, the distributions are completely dispersed, 
reflecting the noise domination in this regime. 



a slightly longer correlation length in the Markov chain, 
and therefore fewer independent samples. 

3.2. Parallelization 

The main limitation for the Gibbs sampling method 
is CPU time. Even though the scaling of the method is 
equivalent to that of a spherical harmonics transform for 
a WMAP type analysis, one has to perform this opera- 
tion many times, and the total prefactor of the algorithm 
is therefore large. Specifically, the number of spherical 
harmonic transforms to produce one Gibbs sample is two 
times the number of CG iterations, times the number of 
frequency bands. The total number of transforms for 
computing one sample from an eight-band WMAP data 
set is then typically on the order of 1000 for the Com- 
mander approach (reaching convergence in 60 iterations) 
and 350 for the MAGIC approach (reaching convergence 
in 20 iterations). Knowing that one harmonic transform 
takes about 5 seconds for A si d c = 512 and £ max = 512, 
the total CPU time required for one single Gibbs sample 
is therefore on the order of one or two hours for Com- 
mander and half an hour for MAGIC. Obviously, paral- 
lelization is essential to produce a sufficient number of 
samples. 

Two fundamentally different approaches may be taken 
in this respect. Either one may choose to run one sin- 
gle Markov chain and parallelize the spherical harmonic 
transforms internally. Since the HEALPix routines oper- 
ate on pixel rings of constant latitude, this can be done 
quite efficiently by letting each processor compute its 
own ring. Nevertheless, optimal speed-up will not be 
achieved, and the implementation will be somewhat com- 
plicated. 

The other approach is to take advantage of the fact 
that this method is truly a Monte Carlo method, and 
one can therefore let each processor run its own Markov 
chain. The most important advantages of this approach 
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Fig. 5. — Power spectrum results from the low-resolution simulations. The blue line indicates the theoretical spectrum from which a 
random Gaussian realization was drawn, and the red curve is the power spectrum of that particular realization. The black curve shows 
the marginalized estimates of this spectrum produced by Commander, while the gray bands indicate the 1 and 2a confidence bands. The 
information in the two panels is the same, but for different ranges in I. The dark gray vertical region in the right panel indicates where 
the signal-to-noise ratio is approximately unity. 



are optimal speed-up and the possibility to initialize each 
chain with a different first guess. As we will see in the 
next section, consecutive Gibbs samples in the Markov 
chain are highly correlated in the low signal-to-noise 
regime, and producing a larger number of independent 
samples is therefore quite expensive. If we have a rough 
approximation of the true spectrum and its uncertainties 
(as we usually do, through a MASTER type analysis; 
Hivon et al. 2002), we can partially remedy this problem 
by initializing each Markov chain with an independent 
power spectrum. 

The major drawback of this latter parallelization 
scheme, however, is that each Markov chain will necessar- 
ily be quite short, perhaps only twenty to fifty samples. 
This problem is due to the fact that most current super- 
computer facilities have a maximum wall-clock time limit 
of 24 to 72 hours, and therefore the maximum length of 
one chain is on the same order of magnitude. Of course, 
one may store intermediate results and restart the com- 
putations after every cycle, but this only increases the 
total length by a factor of a few, not by hundreds. 

We have chosen a combination of external and internal 
parallelization in our implementation, by recognizing the 
fact that we will in general be analyzing multi-frequency 
data sets consisting of iVb an d maps. We may therefore let 
A^band processors work on the same Markov chain, each 
processor transforming one band. Thus, optimal speed- 
up is not compromised, while the length of the chains 
is increased by the same factor. In future versions we 
will also implement fully internal parallelization in the 
HEALPix routines map2alm and alm2map, to have the 
option of focusing all the computational resources into 
one single chain. 

4. SIMULATIONS 

In this section we apply the Commander and MAGIC 
codes to simulated data sets for which all components 
are perfectly known. The goals are two-fold. Firstly, we 



wish to demonstrate that the codes produce results con- 
sistent with theoretical expectations, and secondly, we 
seek to gain insight on what limitations of the algorithm 
we can expect to meet in real-world applications, when 
CPU time is limited. 

A number of different simulations are analyzed in the 
following sections, each designed to highlight some spe- 
cific feature. First, in order to establish the asymp- 
totic behavior of the algorithm, we study a data set of 
smaller size than the full-resolution WMAP data. Specif- 
ically, we construct a data set at intermediate resolution 
(N sidc = 128; 196 608 pixels), for which the CPU time 
per sample is on the order of 10 seconds. Thus, CPU 
time is not a dominating problem, and we can establish 
the Markov chain correlation lengths and power spec- 
trum correlation matrix to great accuracy. The burn-in 
time is also considered. 

Finally, we make two simulations at full WMAP reso- 
lution in order to confirm that the overall results from the 
low-resolution analysis carry naturally over to higher res- 
olutions. This time the CPU cost is the limiting factor, 
and the main goal of this section is in fact to demonstrate 
that the Gibbs sampling method is able to handle even 
large data sets, such as the WMAP data. This analysis 
mimics th e analysis of the first -year WMAP data pre- 
sented bv lO'Dwver et alJ (|2004f) . in that it is run on a 
super-computer with many short, parallel chains. The 
only difference between the two runs is that either white 
or correlated noise are added to the CMB simulations. 
This way we test whether the assumption of white noise 
may compromise the scientific results in the presence of 
small, but non-negligible, noise correlations. We find 
that this is not a significant problem for the first-year 
WMAP data. 

4.1. Low-resolution simulations 

The main goal of the low-resolution simulations is to 
study the asymptotic behavior of the method when the 
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number of independent samples is very high. On the 
one hand, this allows us to verify that the codes work 
as expected without worrying about errors introduced 
because of a limited number of samples, and on the other, 
essential quantities such as the Markov chain correlation 
length and the power spectrum correlation matrix may 
be established to a high degree of accuracy. 

In order to facilitate such long-chain analyses, we study 
maps with relatively low resolution, N s i^ e = 128, but 
with properties corresponding to a consistently down- 
scaled WMAP-type experiment. Specifically, we gener- 
ate a CMB sky from the best-fit WMAP running index 
spectrum, and convolve this sky with modified version 
of the WMAP beams. The beams are made four times 
wider by replacing their original Legendre transform with 
b e - b l r rcs = 1/4 Er=oW- 

The noise components are generated by degrading the 
original WMAP noise rms maps 14 to iV S jd e = 128 by 
simple averaging over pixels in the HEALPix nested or- 
ganization. Thus, the noise per low-resolution pixel in 
our simulated maps is about the same as that for each 
high-resolution pixel in the full-sized WMAP data. The 
signal-to-noise ratio is therefore downscaled to the ap- 
propriate resolution, a fact which will be important when 
studying the relationship between the correlation length 
and the signal-to-noise ratio. 

We also want to study the effect of residual mono- 
and dipoles on the cosmological power spectrum, and we 
therefore add a random mono- and dipole contribution 
with an artificially large amplitude (on the order of tens 
to a hundred mK) to the signal plus noise map. The 
reconstructed values are then later compared with the 
exact input values. 

Finally, we generate a degraded mask to match this 
res olution, based on the WMAP Kp2 mask as defined 
bv lBennett et all l)2003b|) . This mask is downgraded to 
Aside = 128 by requiring that all high-resolution sub- 
pixel within a N a id e = 128 pixel (again, in the HEALPix 
nested organization) are included by the original mask. 
Thus, this mask is very slightly expanded compared to 
the actual Kp2 mask. 

4.1.1. Verification of algorithms and codes 

In the first test, we apply the Commander and MAGIC 
codes to one single band from the data set described 
above, namely to the VI band. Commander was run for 
100 000 samples, while MAGIC was run for 4000, with 
the main goal of comparing the codes, verifying that they 
produce identical output. 

The results from this exercise are shown in figure 
The black probability densities show the Commander re- 
sults, while the red histograms show the MAGIC results. 
The agreement is striking, and this is a strong confirma- 
tion that the codes work as expected, and that the minor 
differences in implementational details discussed earlier 
do not affect the scientific results. 

The dashed lines show the true, underlying CMB 
power spectrum value, which should theoretically coin- 
cide with the peaks of the histograms, in the limit of full 



14 The noise rms maps are defined by Oi (p) = <j j /-J A r * bs (p) , 
where Oq is the average sensitivity of the various bands, and 
N\ (p) is the number of observations for each pixel p. 



sky coverage and no noise. At low £'s, we see that this is 
indeed the case. Here it is also worth recalling that we 
added artificially large mono- and dipole components to 
the simulations (several orders of magnitudes larger than 
what is realistic), and this does still not compromise the 
results. 

In figure \5\ we have plotted the full spectrum com- 
puted from the 100000 sample run. The input spec- 
trum is marked in red, the ensemble-averaged spectrum 
in blue, and the maximum likelihood solution from the 
Gibbs sampler in black. The gray bands indicate 1 and 
2a confidence regions for the power spectrum. 

At low ^'s the discrepancy between the estimated and 
input spectra is primarily due to the galactic cut, while 
at high fs it is primarily due to noise. In particular, we 
see that the maximum likelihood estimate actually drops 
to zero for many of the low signal-to- noise ratio modes, 
which, again, is the expected behavior for a maximum 
likelihood estimator in the noise-dominated regime. 

In figure El we plot two different correlation matrices, 
each on the form 

'C t -(C t )C t .-(C t ,)\ 
y/VaxCe VVarCV /' 
The averages are taken over the 100000 samples in the 
Markov chain described above. The left panel shows 
the correlation matrix of the sampled power spectra, 
which are basically uncorrelated by construction, while 
the right panel shows the correlation matrix of the power 
spectra, ae, computed from the sampled maps, s. The 
latter matrix is related to the correlation matrix of the 
maximum likelihood power spectrum found by maximiz- 
ing the posterior, and mainly describes mode-mode cou- 
pling due to the cut sky on these scales. 

Finally, in figure we plot the distributions of the 
mono- and dipole samples and compare them to the in- 
put values, marked by dashed, vertical lines. Although 
there certainly is a discrepancy between the distribu- 
tion modes and the input values, the overall rms values 
are very small, on the order of 5-10 /xK, and consistent 
with the fluctuation level expected for a single realiza- 
tion. Further, there is some coupling between the mono- 
and dipole modes due to the galactic cut, which could 
be important. However, since our sole interest in these 
components lies in removing them, rather than estimat- 
ing them, this is not an important problem for our pur- 
poses. In fact, given the very small impact of these very 
large mono- and dipole components, we feel confident 
that the cosmological \ow-£ spectrum is not compromised 
by mono- and dipole issues. 

4.1.2. Convergence and correlations 

We now turn to the issues of convergence, correlation 
length and burn-in time, all of which must be thoroughly 
understood in order to design and optimize a real- world 
analysis properly. The problem can be plainly stated as 
follows: How many Gibbs samples do we need to estimate 
the power spectrum with sufficient accuracy in order to 
be limited by non-algorithmic issues? As we will see, the 
answer depends intimately on which angular scales we 
wish to consider, a conclusion which is most easily seen 
by going back to the Gibbs sampling scheme. 

The algorithm works as follows: First we assume some 
arbitrary (but hopefully reasonable) power spectrum, 
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Real- valued multipole amplitude, a, (uK) 

Fig. 7. — Distributions of mono- and dipole samples. The dashed, 
vertical lines show the true input value, and the histograms show 
the sampled values. The observed shift between the mode of the 
distributions and the input values is most likely due to internal 
couplings between the components. The most important result, 
however, is that the scatter is very small, with typical rms values 
smaller than 10 /iK, even for the unrealistically large input values 
used in this experiment. 



and compute a Wiener-filtered map based on that spec- 
trum. Then we add a fluctuation term which replaces 
the power lost both to noise and to the galactic cut. The 
sum of those two terms mimics a full-sky, noiseless map 
with a power spectrum determined by the data in the 
high signal-to-noise regime, and by the assumed power 
spectrum in the low signal-to-noise regime. From this 
full-sky power spectrum we then draw a new spectrum, 
which subsequently is taken as the input spectrum for 
the next Gibbs iteration. 

The crucial point is that the random step size in the fi- 
nal stage is determined by the cosmic variance alone. Our 
goal is to probe the full probability distribution which 
includes both noise and cosmic variance. In the high 



signal-to-noise regime, the difference does not matter. 
Sequential Gibbs samples are therefore for all practical 
purposes uncorrelated. The opposite is true in the low 
signal-to-noise regime: since the distance between the 
two samples is determined by the cosmic variance, while 
the full distribution is dominated by the much larger 
noise variance, two sequential samples will be strongly 
correlated. 

This problem is a severe limitation for the Gibbs sam- 
pling technique in its current formulation. It makes it 
very expensive to probe the low signal-to-noise regime 
completely. The Gibbs sampling technique is only a spe- 
cial case of the more general Metropolis-Hastings frame- 
work. Other sampling schemes may be devised which 
break the correlation between neighboring samples. This 
will be the topic of a future publication, and for now our 
main goal is to quantify this effect, rather than eliminate 
or minimize it. 

We take advantage of the low-resolution simulations 
in order to quantify these correlations. Specifically, we 
consider the power spectrum values at constant £ in the 
Markov chain as independent functions, and study the 
correlations in these chains as a function of £. The statis- 
tic we choose for this study is a simple auto-correlation 
function, 



C(n) 



Ci-(C e )Ci +n -(C e ) 
VVar Ut VVar C e 



(30) 



Here n is the distance in the chain measured in number 
of iterations. Such functions are plotted in figure 



8 a 



for 

six different £'s, computed from a new Commander chain 
consisting of 3800 samples, including all eight bands. 

As expected, the correlations become stronger as I in- 
creases, or, equivalently, as the signal-to-noise ratio de- 
creases. In this particular case, the signal-to-noise ratio 
is unity at approximately £ = 85, and therefore the spec- 
trum is limited by cosmic variance at smaller fs. This 
translates into a very short correlation length for £ = 50 
in this case, and consequently into a high efficiency in 
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(a) (b) (c) 

Fig. 8. — The relationship between the signal-to-noise ratio and the correlation length of the Markov chains, a) The correlation function 
of the Markov chain, computed for a few selected multipoles from a Commander chain. Note how the correlations are stronger when the 
signal-to-noise ratio decreases, b) The typical correlation length as a function of signal-to-noise ratio. The typical correlation length is 
defined as the distance for which the correlation functions in the left panel drop below 0.1. c) The ratio of the MAGIC correlation length 
to the Commander correlation length, as a function of multipole. 



terms of independent samples. On the other extreme, 
the correlation length at £ = 140 is very, very long, and 
with only 3800 samples in the chain, we have only a very 
few independent samples from which to form our power 
spectrum estimate. 

We can take this exercise one step further and define 
a typical correlation scale for each £, by computing the 
scale at whic h the correlation function drops under, say, 
0.1. In figure 8(b) we have plotted this correlation length 
directly as a function of the signal-to-noise ratio, and 
from this plot there seems to be a well-defined relation- 
ship between these two quantities. In fact, we will use 
this relation to estimate how many samples we need in 
the actual WMAP analysis later on. For now we note 
that with 3800 samples, as in the above case, we have 
about 200 independent samples at a signal-to-noise ratio 
of 0.6, which corresponds to I rs 105. In other words, it 
would be rather optimistic to believe in the power spec- 
trum based on these samples at £'s higher than, say, 110. 

Another lesson to be learned from these plots is that 
the correlation length increases very rapidly with £, once 
entering the low signal-to-noise regime. This is an im- 
portant point to realize when desiging a new analysis: 
probing the low signal-to-noise regime with the current 
implementation of the Gibbs sampling algorithm is ex- 
tremely expensive. It may therefore often be desirable to 
limit £ max to the £ corresponding to a signal-to-noise ra- 
tio of, say, 0.5 or 0.25. The saved CPU time 15 may then 
be spent on producing more independent samples in the 
high and intermediate signal-to-noise regimes. However, 
truncating the system this way does modify the global 
solution, and care must therefore be taken with respect 
to the highest £'s. In general, the larger the sky cut, the 
more high-i? modes will have to be discared from the fi- 
nal power spectrum, since mode-mode couplings spread 
the sharp ^-space cut-off into a wide range of multipoles. 
In practice, it is convenient to pre-define some range of 
ts of interest, and then increase £ max until that range 
becomes stable. 



In figure 8(c) we have plotted the ratio of the MAGIC 
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Fig. 9. — Burn-in time of the Gibbs sampler, computed from 
the low-resolution simulations. The initial guess was chosen to be 
three times the exact spectrum, and the sampler was run for 80 
iterations. Note how slowly the chain converges toward the correct 
region (i.e., toward the horizontal zero-axis) in the low signal-to- 
noise regime. A good initial guess is essential for the parallclization 
scheme proposed in this paper. 



correlation length to the Commander correlation length, 
and here it is seen, as noted earlier, that the MAGIC cor- 
relation length is typically a factor of 1.5-2 longer at low 
£'s, resulting in a smaller number of independent samples 
of the same factor. Of course, this is both caused and 
made up by the fact that MAGIC handles the incom- 
plete sky coverage differently than Commander. Since 
MAGIC obtains convergence in the CG search roughly 
three times faster than Commander (using a very crude 
preconditioner), the codes do perform quite similarly in 
terms of total CPU time per independent sample. 

Finally, we turn to the issue of burn-in time. Although 
the theory of Gibbs sampling guarantees us that the sam- 
ples will converge toward being samples from the joint 
distribution density, it does not tell us when such conver- 
gence is obtained, and this must therefore be established 
by experiments. We study this issue through a simple 
exercise: Once again we utilize the simulated data de- 
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Fig. 10. — Results from high-resolution simulations. The red lines show the true power spectrum of the input realization, the blue lines 
show the ensemble averaged spectrum from which the CMB sky was drawn, and the black lines show the maximum likelihood spectra 
estimated by the Gibbs sampler. The gray bands indicate 1 and 2cr confidence regions. The spectra are computed from the combined 
Q+V+W simulations, taking into account individual beam and noise properties, and adding either white noise (left panel) or correlated 
noise (right panel) to the simulations. 



scribed above, but this time we choose a first power spec- 
trum guess which is exactly three times larger than the 
true spectrum. Then we run the algorithms for a num- 
ber of iterations, and plot the power spectrum samples 
as a function of iteration count, i. The results from this 
exercise are shown in figure [51 in the form of 



Note that the spectra have been averaged with a window 
width of A£ = 10, making it easier to see the overall 
trends. 

The conclusion to be drawn from this plot seems clear: 
A poor initial guess can invalidate a large number of 
samples, and, in particular, a weak estimate of the low 
signal-to-noise regime is very expensive to correct. This 
can potentially pose a serious threat to our main paral- 
lelization scheme, which is based on many independent 
short chains, rather than one long chain. For this rea- 
son, the Gibbs sampling approach in its current form 
may not be particularly well suited as the only estimator 
for a new experim ent. A faster method, such as Master 
ijHivon et al.11200 2?). is therefore suggested to provide an 
initial guess for the Gibbs samplers. Once an approxi- 
mate power spectrum is established, the Gibbs sampling 
process is already within the appropriate range, and only 
a few samples need to be discarded, if any at all. How- 
ever, we do not need to rely blindly on the first guess, 
since a poorly chosen starting point would lead to a sys- 
tematic drift in the Gibbs chains which should be easily 
detectable. 

4.2. High-resolution simulations 

In this section we turn to high-resolution simulations, 
and undertake a full-scale WMAP-type analysis. The 
simulations in this case are prepared in the same way 
as in the low-resolution case, except with full-scale input 
data, and no inclusion of mono- and dipole components. 

The main limitation in this case is CPU time, and ex- 
tremely long chains are simply not feasible. Instead, we 



run many independent chains in parallel, each producing 
only a small number of samples, as discussed in section 

El 

The analysis is designed to match the analysis of 
the fir st-year WMAP data presented bv lO'Dwver et all 
( 2004). Specifically, we generate a random sky with the 
HEALPix utility synfast, and convolve this sky with the 
beams corresponding to each of the eight WMAP bands 
(Ql-2, Vl-2, Wl-4). Next we add either white noise 
(with the appropriate iV b s patterns for each band) or 
correlated noise (as generated by the WMAP team 16 ) 
to these CMB maps. Finally, the WMAP Kp2 mask 
l)Bennett, et alJl2003btl which excludes point sources is 
imposed on the data, leaving 85% of the sky available 
for analysis. At this stage, the Gibbs sampler is run over 
12 independently initialized chains for 60 iterations, for 
a total of 720 samples. 

We point out that the numbers of observations per 
pixel, A^ bs, in the correlated noise files supplied by the 
WMAP team do not match perfectly those of the ob- 
served map files, and unless the appropriate iV b s pat- 
terns are used in each case, a noise excess at I > 350 is 
observed. The white noise level, however, are identical 
for the two patterns, and so this difference does not have 
a significant impact on the results, as long as one is aware 
of the difference. 

In figure 1101 we have plotted the power spectra from 
the multiple-chain analysis, including white noise in the 
left panel and correlated noise in the right panel. Over- 
all, we see that the agreement between the realization 
specific spectrum (red line) and the maximum likelihood 
solution (black line) found by the Gibbs sampler is quite 
good, and there is no detectable bias in any parts of the 
spectrum. 

Next, in figure ^2 we plot a few selected histograms 
of the power spectra, comparing the white (black his- 
tograms) and correlated (red histograms) noise results 
more directly. As we see, the agreement is generally very 

16 Available at http://lambda.gsfc.nasa.gov 
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Power spectrum, C ; 1(1+1) / 2% (10 3 |iK 2 ) 

Fig. 11. — The Cg distributions of a few selected multipoles, com- 
paring the white noise (black histograms) to the correlated noise 
(red histograms) samples. The true realization specific spectrum 
is marked with a dashed vertical line, and the ensemble averaged 
spectrum with a solid line. 



good, and in particular, the three upper panels clearly 
demonstrate that the low level of correlated noise present 
in the WMAP data do not compromise the \ow-£ spec- 
trum. 

At higher £'s, a small shift may be seen between the two 
distributions, which is most likely due to the fact that the 
noise realizations are different. We made similar plots for 
neighboring multipoles, finding that the absolute levels of 
discrepancy seen in figure 1111 are quite typical for these 
angular scales, and the signs of the shifts are random. 
Thus, the differences does not seem to be indicative of a 
systematic bias. 

By studying simulated data, we have thus explicitly 
demonstrated that the Gibbs sampling technique is able 
to analyze the mega-pixel WMAP data set properly, and 
that neither correlated noise nor incomplete sky coverage 
compromise the scientific results significantly. All in all, 
the feasibility of this approach with respect to current 
and future data sets has been firmly established. 

5. CONCLUSIONS 

We have implemented two independent versions of 
the Gi bbs s ampling technique int roduced bv lJewell et al.l 
l|2004j) and lWandelt et all $2004) . and tested the perfor- 
mance and behavior of the codes thoroughly. In particu- 
lar, we have explicitly verified that the two implementa- 
tions produce identical output, despite a few algorithmic 
differences, demonstrating that these algorithmic choices 
do not affect the scientific results. Further, we applied 
the codes to simulated data with controlled properties, 
and found the output to agree very well with the theoret- 
ical expectations. In doing so, we also demonstrated the 
feasibility of the method for high-resolution applications. 

One of the main goals of these simulations was to build 
up intuition about the phenomenological behavior of the 
Gibbs sampling algorithm, focusing in particular on is- 
sues such as the correlation length of the Markov chains, 
and the burn-in and convergence time. Through these 
experiments, we found that the signal-to-noise ratio is 
by far the most constraining factor to the algorithm in 
its current form. The step size between two consecu- 



tive samples is determined by the cosmic variance alone, 
while the overall posterior density incorporates noise un- 
certainty as well. Thus, in the low signal-to-noise regime 
subsequent samples are highly correlated, and the effec- 
tive number of independent samples is dramatically re- 
duced. In its current formulation, the method is therefore 
most efficient at scales for which the signal-to-noise ratio 
is higher than, say, 0.5, or perhaps up to i — 350-400 
for the first-year WMAP data. On the other hand, the 
Gibbs sampler is only a special case from a more general 
framework, and other sampling schemes may be intro- 
duced in order to break these correlations. This will be 
the topic of a future publication. 

Perhaps the single most appealing feature of the Gibbs 
sampling approach, is its ability to incorporate most real- 
world complications in a statistically consistent manner. 
In this paper we have demonstrated how to handle in- 
complete sky coverage and unknown mono- and dipole 
contributions, which are the most important point for 
the analysis of the first-year WMAP data, but future 
extensions will also include polarization, more sophisti- 
cated treatment of foregrounds, internal sampling over 
cosmological parameters, inclusion of asymmetric beams, 
and statistically consistent handling of 1 // noise. In fact, 
the Gibbs sampling approach is not simply a maximum 
likelihood method, but rather a machinery facilitating 
an optimal, global analysis. Needless to say, the compu- 
tational challenges are considerable, but with a scaling 
equivalent to that of map making (which has to be per- 
formed in any approach currently proposed), this method 
may just be able to do the job. 

A second goal of this paper was to prepare for the ac- 
tual analysis of the WMAP data, by applying the algo- 
rithm to simulated data with similar properties. Specif- 
ically, we showed that the estimated power spectrum 
is unbiased, and that even the lowest-order multipoles 
are not compromised by the either the galactic cut, 
giv en that the foreground correction method presented 
by iBennett et alJ l)2003aj) is adequate, or by the (low 
levels of) correlated noise present in the data. Thus, no 
further sophistications beyond those presented in this pa- 
per seem necessary in order to perform a valid Bayesian 
analysis of the first-year WMAP. The scientific results 
from this analy s is are presented in a companion letter by 
lO'Dwver et ail 1I2OOI . 
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